feat(datafusion): implement the project node to add the partition columns #1602

fvaleye · 2025-08-13T13:16:45Z

Which issue does this PR close?

Closes Implement Project Node: Caculate partition value #1542

What changes are included in this PR?

Implement a physical execution plan node that projects Iceberg partition columns from source data, supporting nested fields and all Iceberg transforms.

Are these changes tested?

Yes, with unit tests

…umns defined in Iceberg. Implement physical execution plan node that projects Iceberg partition columns from source data, supporting nested fields and all Iceberg transforms.

…to-datafusion

CTTY · 2025-08-21T01:20:29Z

crates/integrations/datafusion/src/physical_plan/project.rs

+            let field_path = Self::find_field_path(&self.table_schema, source_field.id)?;
+            let index_path = Self::resolve_arrow_index_path(batch_schema.as_ref(), &field_path)?;
+
+            let source_column = Self::extract_column_by_index_path(batch, &index_path)?;


This looks very interesting! I actually came across the similar issue when implementing the sort node, and I was leaning toward implementing a new SchemaWithPartnerVisitor, wdyt?

Perfect 👌
I was initially thinking this was needed just for this implementation, but it seems the right place would be closer to the Schema definition. Since this is a standard method for accessing column values by index, it makes sense to generalize!

I drafted a PartitionValueVisitor here to help extract partition values from a record batch in tree-traversal style

Pleast let me know what you think!

I just saw this implementation to extract partition values and it actually makes more sense to me that it leverages the existing RecordBatchProjector: #1040

Good, thanks for sharing. I will use #1040 when merged!

Hey @CTTY 👋,
I can use it now, but I have one concern about leveraging RecordBatchPartitionSplitter, it relies on PARQUET_FIELD_ID_META_KEY.
Since DataFusion doesn’t use this key, do you think we should adapt this method to make it compatible with DataFusion?

fvaleye · 2025-08-21T13:42:50Z

crates/integrations/datafusion/src/physical_plan/project.rs

+    }
+
+    /// Find the path to a field by its ID (e.g., ["address", "city"]) in the Iceberg schema
+    fn find_field_path(table_schema: &Schema, field_id: i32) -> DFResult<Vec<String>> {


We might need to consider this function as well @CTTY following our discussion here.
It may not be the right place at the moment.

crates/integrations/datafusion/src/physical_plan/project.rs

…n containing all the partitions values

crates/integrations/datafusion/src/physical_plan/project.rs

…to-datafusion

liurenjie1024 · 2025-09-22T09:46:41Z

Hi, @fvaleye is this pr ready for review or you still need some effort to improve it?

fvaleye · 2025-09-22T09:50:48Z

Hi, @fvaleye is this pr ready for review or you still need some effort to improve it?

Hi @liurenjie1024 👋,
I removed the DataFusion integration and its extra node, so it's ready for review.
I tried to use RecordBatchProjector (from this), but it doesn't meet our requirements for DataFusion (it uses PARQUET_FIELD_META_KEY).

So, yes, it's ready for review. However, it might require additional refactoring if we want to make these utility functions more general.
Please tell me what you think!

liurenjie1024

Thanks @fvaleye for this pr! I left some comments to improve, and I still have other questions:

What's the entry point of this module?
Could the entrypoint of this module be a funtion like sth below:

fn porject_with_partition(input: &ExecutionPlan, table: &Table) -> Result<Arc<dyn ExecutionPlan>> {
// This method extend `input` with an extra `PhysicalExpr`, which calculates the partition value.
....
}

crates/integrations/datafusion/src/physical_plan/project.rs

liurenjie1024 · 2025-09-24T10:20:08Z

crates/integrations/datafusion/src/physical_plan/project.rs

+/// Extract a column from a record batch by following an index path.
+/// The index path specifies the column indices to traverse for nested structures.
+#[allow(dead_code)]
+fn extract_column_by_index_path(batch: &RecordBatch, index_path: &[usize]) -> DFResult<ArrayRef> {


Could we reuse RecordBatchProjection?

I tried, but I kept this implementation, the main reasons are below:

1. Metadata Dependency:
RecordBatchProjector depends on Arrow field metadata containing PARQUET:field_id
This metadata is added when reading Parquet files through Iceberg's reader
DataFusion ExecutionPlans might not always have this metadata preserved

2. Using the Iceberg table's schema directly
We resolve field paths using field names, not IDs
This works regardless of whether Arrow metadata is present

Depending on what you think:

We could keep this implementation working with DataFusion

Readapt RecordBatchProjection but it feels like it's not the same intent

I'm not convinced. There are two ways to solve your issue:

Add a constructor in RecordBatchProjector to accept iceberg schema and target field ids.

Convert iceberg schema to arrow schema, the converter will add field_id metadata.

Personally I prefer approach 1, but I don't have a strong opinion about. After using RecordBatchProjector, the whole pr could be simplified a lot.

…use PhysicalExpr for partitions values calculation. Signed-off-by: Florian Valeye <[email protected]>

crates/integrations/datafusion/src/physical_plan/project.rs

liurenjie1024 · 2025-09-25T10:09:31Z

crates/integrations/datafusion/src/physical_plan/project.rs

+/// Extract a column from a record batch by following an index path.
+/// The index path specifies the column indices to traverse for nested structures.
+#[allow(dead_code)]
+fn extract_column_by_index_path(batch: &RecordBatch, index_path: &[usize]) -> DFResult<ArrayRef> {


I'm not convinced. There are two ways to solve your issue:

Add a constructor in RecordBatchProjector to accept iceberg schema and target field ids.

Convert iceberg schema to arrow schema, the converter will add field_id metadata.

Personally I prefer approach 1, but I don't have a strong opinion about. After using RecordBatchProjector, the whole pr could be simplified a lot.

liurenjie1024 · 2025-09-25T10:10:17Z

crates/integrations/datafusion/src/physical_plan/project.rs

+            let field_path = find_field_path(&self.table_schema, source_field.id)?;
+            let index_path = resolve_arrow_index_path(batch_schema.as_ref(), &field_path)?;


We don't need to to them for every batch.

liurenjie1024 · 2025-09-25T10:10:30Z

crates/integrations/datafusion/src/physical_plan/project.rs

-        let partition_value = transform_fn
-            .transform(source_column)
-            .map_err(to_datafusion_error)?;
+            let transform_fn = iceberg::transform::create_transform_function(&pf.transform)


Ditto, this only needs to be done once.

This is not resolved, we could create trnasform functions in constructor.

Signed-off-by: Florian Valeye <[email protected]>

liurenjie1024 · 2025-09-26T09:29:47Z

crates/integrations/datafusion/src/physical_plan/project.rs

+                .map(|pf| pf.source_id)
+                .collect();
+
+            let projector = RecordBatchProjector::from_iceberg_schema_mapping(


We don't need first batch to get input's schema, see: https://github.com/apache/datafusion/blob/921f4a028409f71b68bed7d05a348255bb6f0fba/datafusion/physical-plan/src/execution_plan.rs#L106

liurenjie1024 · 2025-09-26T09:55:50Z

crates/integrations/datafusion/src/physical_plan/project.rs

-        let partition_value = transform_fn
-            .transform(source_column)
-            .map_err(to_datafusion_error)?;
+            let transform_fn = iceberg::transform::create_transform_function(&pf.transform)


This is not resolved, we could create trnasform functions in constructor.

crates/iceberg/src/arrow/record_batch_projector.rs

…ation Signed-off-by: Florian Valeye <[email protected]>

…to-datafusion

crates/iceberg/src/transform/mod.rs

liurenjie1024 · 2025-10-11T03:02:05Z

crates/iceberg/src/arrow/mod.rs

 mod reader;
-pub(crate) mod record_batch_projector;
+/// RecordBatch projection utilities
+pub mod record_batch_projector;


Why do we need to make this pub?

crates/integrations/datafusion/src/physical_plan/project.rs

liurenjie1024 · 2025-10-11T07:01:58Z

crates/integrations/datafusion/src/physical_plan/project.rs

+
+impl std::hash::Hash for PartitionExpr {
+    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
+        std::any::TypeId::of::<Self>().hash(state);


This is a little odd, why not derive Hash for PartitionValueCalculator?

I refactored to use the approach between Hash and PartialEq.
For now, we can't derive Hash for PartitionValueCalculator contains:
partition_type: DataType: Arrow's DataType does not implement Hash
projector: RecordBatchProjector: does not implement Hash
transform_functions: Vec<BoxedTransformFunction> trait objects cannot be hashed

liurenjie1024 · 2025-10-11T07:05:10Z

crates/integrations/datafusion/src/physical_plan/project.rs

+
+    let input_schema = input.schema();
+    let partition_type = build_partition_type(partition_spec, table_schema.as_ref())?;
+    let calculator = PartitionValueCalculator::new(


This is implicit assume that the input_schema exactly matches iceberg table schema. I think this assumption is valid for now, but we should add a check here to ensure that.

Yes, I can add a method for this.

…e schemas Signed-off-by: Florian Valeye <[email protected]>

…to-datafusion

fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from b3a8601 to 40a225a Compare August 13, 2025 13:17

feat(datafusion): implement the project node to add the partition col…

4d59f87

…umns defined in Iceberg. Implement physical execution plan node that projects Iceberg partition columns from source data, supporting nested fields and all Iceberg transforms.

fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from 40a225a to 4d59f87 Compare August 13, 2025 14:50

Merge branch 'main' into feature/implement-project-node-for-insert-in…

d930df9

…to-datafusion

CTTY reviewed Aug 21, 2025

View reviewed changes

fvaleye commented Aug 21, 2025

View reviewed changes

CTTY reviewed Aug 22, 2025

View reviewed changes

crates/integrations/datafusion/src/physical_plan/project.rs Outdated Show resolved Hide resolved

feat(datafusion): adapt IcebergProjectExec to use one partition colum…

803199a

…n containing all the partitions values

liurenjie1024 reviewed Aug 28, 2025

View reviewed changes

crates/integrations/datafusion/src/physical_plan/project.rs Outdated Show resolved Hide resolved

fvaleye mentioned this pull request Sep 2, 2025

feat(datafusion): implement the partitioning node for DataFusion to define the partitioning #1620

Open

fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from 7f8404f to 7be558d Compare September 21, 2025 15:33

Merge branch 'main' into feature/implement-project-node-for-insert-in…

bc805db

…to-datafusion

fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from 57fe2dd to bc805db Compare September 21, 2025 15:47

liurenjie1024 reviewed Sep 24, 2025

View reviewed changes

feat(datafusion): add the project_with_partition main entrypoint and …

531388f

…use PhysicalExpr for partitions values calculation. Signed-off-by: Florian Valeye <[email protected]>

liurenjie1024 reviewed Sep 25, 2025

View reviewed changes

feat(datafusion): reuse RecordBatchProjector for project.

bdc6aae

Signed-off-by: Florian Valeye <[email protected]>

liurenjie1024 reviewed Sep 26, 2025

View reviewed changes

refactor: simplify RecordBatchProjector and optimize partition calcul…

d4fd336

…ation Signed-off-by: Florian Valeye <[email protected]>

fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from edb4719 to d4fd336 Compare October 1, 2025 11:54

Merge branch 'main' into feature/implement-project-node-for-insert-in…

ac0fd57

…to-datafusion

liurenjie1024 reviewed Oct 11, 2025

View reviewed changes

fvaleye added 2 commits October 11, 2025 15:08

feat(project): add a validation method between Arrow and Iceberg tabl…

37ac5d5

…e schemas Signed-off-by: Florian Valeye <[email protected]>

Merge branch 'main' into feature/implement-project-node-for-insert-in…

21468bf

…to-datafusion

		let field_path = find_field_path(&self.table_schema, source_field.id)?;
		let index_path = resolve_arrow_index_path(batch_schema.as_ref(), &field_path)?;

feat(datafusion): implement the project node to add the partition columns #1602

Are you sure you want to change the base?

feat(datafusion): implement the project node to add the partition columns #1602

Uh oh!

Conversation

fvaleye commented Aug 13, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 commented Sep 22, 2025

Uh oh!

fvaleye commented Sep 22, 2025

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants